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ABSTRACT 


This thesis asserts that Cluster Analysis, or Numerical 
Taxonomy, has many potential applications in the field of 
international relations. It demonstrates two representative 
applications. Both examples treat the nations of the world 
as objects having measurable attributes, and both examples 
use selected attributes to produce a dendrogram (or 
hierarchical classification) of the nations of the world. 

In one example this dendrogram is used to objectively group 
the nations into blocs based on external economic ties. In 
the other example the dendrogram is used to highlight inter- 
aecLOns "amen e eve attributes, ignoring the identity of 
individual nations, the same way a scatter plot highlights 


interactions between two variables. 
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IT. BACKGROUND 


A. CLUSTER ANALYSIS DEFINED 
The subject of this thesis is a group of mathematical 

techniques known collectively as either Cluster Analysis or 

Numerical Taxonomy. The terms are Seinieenee Their formal 

definition, paraphrased from Ref. 6, is actually a sequence 

of definitions, as follows 

classification system - a set of subsets of a set of 

objects which conveys some 
information about the objects 


taxomemy -— the science of constructing classificatory 
systems 


Cluster Analysis or ORME oren “Late ene - the science 
Geeconstructing mathematical classificatory 
systems 

in less tcmual terms, Cluster Analysis includes all mathe- 
matical methods of classifying objects into sets so as to 
represent complex data in a simpler way which will serve as 
a fruitful source of hypothesis. | 
/ Cluster Analysis ene WO stage process. The first stage 
is to choose quantifiable attributes that describe the objects, 
and then use these attributes to measure the pair-wise dissim- 
ilarity among the objects. The second stage is to represent 
these dissimilarities by an appropriate classificatory system 
or display. 

The input to Cluster Analysis is normally an nxm matrix 


of data, measurements of m attributes for each of n objects. 





The output from Cluster Analysis is normally one of three 
eersp lays : 


A hierarchical classification, commonly called a tree 
diagram or dendrogram; 


A partition of the objects into mutually exclusive sets, 
each set described by a "profile" or vector of m average 
attribute values; 


A "clumping" of the objects into sets that may overlap, each 
set again described by a profile. 


/ The value of these outputs is that they summarize the original 
data objectively and they tend to highlight subtle interactions 
ijeechne OfYetmarmea@aca, Enabling a user to formulate reasonable 


hypotheses about these interactions. 


Ba PREVIOUS@aEetaCATIONS OF CLUSTER ANALYSIS 

Cluster Analysis was developed in the eighteenth century 
by botanists and biologists attempting to inject more objec- 
tivity into their classifications of plant and animal specimens 
(the familiar phylum-genus-species scheme). Subsequently the 
same technique was used by geologists. Most recently, Cluster 
Analysis Haoeeounad numerous applications in the social sciences, 
particularly in psychology. Reference 1] describes an applica- 


tion that is representative. 





ig, PROPOSE DABeE eh niOmsS OF CLUSTER ANALYSSS 
Vetoes UE PARTMENT 

The United States State Department is currently trying 
to revitalize its policy-making and resource-allocating 
functions, a la the Defense Department metamorphosis under 
Robert McNamara. This revitalization effort has been under- 
way for eight years now. In that time there has been published 
a plethora WEeeSnecs 2 and 3 are representative) of "master 
plans" for the incorporation of Systems Analysis in the State 
Department. 

This paper does not propose anotheremaster plan, but 
merely suggests that a single existing statistical analysis 
technique has useful applications within the State Department. 
The existing technique is Cluster Analysis, and the potential 
applications within the State Department are described and 


demonstrated in the pages that follow. 


A. FIRST APPLICATION: TO HIGHLIGHT INTERACTIONS OF VARIABLES 
lL. General Teepe ion 
iim! | cavion, Cluster Analysis highlights the 
Tntebactioneror several variables the same way a scatter plot 
WOulG sOr twesvariabples,. It inputs an nmxm matrix of data 
(n countries, each PEeer bed by m variables) and outputs a 
dendrogram. The dendrogram itself says nothing about inter- 7 


actions among the variables. But it is a simple matter to 


select a clustering level (where k = the number of clusters) 











aad phot the distmaibugion @f the m variables within.each 
cluster. Comparisons among these plots should bring out all 
significant interactions among the variables. In particular, 
it should highlight mutual interaction among three variables 
or even among four variables just as easily as it highlights 
a two-way interaction. This is a potential not shared by 
factor analysis and regression techniques. 
2. Scenario for Demonstration 

The United States Constitution lists Freedom of the 
Press, Freedom of Speech, and Freedom of Assembly as inalien- 
able human rights. Although one might argue that these 
precise terms have been eclipsed by communications technology, 
most of the Western world would agree that "free and facile 
communication among the people" is an essential quality in a 
free and productive society. Having tentatively accepted 
mms vests ,Mausociologist or political scientist might well 
wish to dissect the concept of "free and facile communication 
among the people," to define it in quantifiable terms. More- 
over, a policy planner in the State Department might well wish 
to go.one step further: to use this quantitative definition 
in a comparative study of the countries of the world. Such 
comparisons are made every day with respect to Gross National 
Product, Life Expectancy, etc. Why not also tabulate a FFCAP 
(free and facile communication among the people) Index? 

Assuming that the State Department considered it 
worthwhile to develop such an index, they would probably task 


a team of their sociologists to propose a list of measurable 
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factors that either contribute to or detract from "free and 
facile communication among the people." This team would 
GCemtainly “apprecware the pmactical advantages of building 
this list around statistics that had already been measured, 
and using the existing data to continually validate their 
theories against the real world. 

Cams they would probably be faced, early on in their 
proceedings, with a large volume of existing data to be 
perused, or analyzed in a very general sense. At this point 
they could profit greatly from applying Cluster Analysis to 
haeenlignt emesanveraction of variables. 

3. “Ghedicemot eDatra 

To demonstrate this application, the author has usurped 
the role of S¢ave Department sociologist and selected the 
following statistics as "measurable factors that either contri- ) 
bute to or detract from free and facile communication among 
the people": 

Variable 1. Concentration of Population in Cities, 1965 
Variable 2. Radios per 1000 Population, 1965 


Variable 3. Students in Higher Education (Third Level) 
per One Million Population, 1965 


Variable 4, Ethno-Linguistic Fractionalization 
Variable 5. Press Freedom Index, 1965 
See Appendix A for definitions of these variables. 
It is readily admitted that this list is not as 
complete as it should be. In particular, "Literacy Rate" is 


conspicuous by its absence, and some measure of newspaper 
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circulation seems a necessary counterpart to Variable e. 

The reason for such omissions was unavailability of data, 

an affliction that is widespread among independent researchers 
But not shared by insiders at the State Department. 

The unavailability of data to this researcher imposed 
Memmi d war bimededtaienyson this demonstration besides the omission 
of some desirable variables. Table I displays values of the 
aforementioned variables for only 85 of the 136 nations in the 
world. It was necessary to delete the other nations because 
of exceSsive missing data. 

4. Choice of Dissimilarity Coefficient 

This section describes the process of eee wows 
data in Table I to a matrix of Dissimilarity Coefficients. 

The first decision point was to specify a formula for 
the Dissimilarity Coefficient (DC). The DC is a single real 
numecre seecanying the amount of dissimilarity between Country 
A and Country B, obtained by somehow combining the five data 
points describing each country. There are many different 
formulas for transforming these ten data points to a single 
DC. Cormack presents a concise but comprehensive summary of 
all the common formulas in Table 1 of Ref. 4. 

In the situation at hand it was decided to use a = 


buctideatmUmseamace, standardized by range. That is, 


> 
DOKGEa)) Wy(Xyy —Xyy)° 
vel 
where Wo = + 
max(X,.,— X5,,) 


1,3 
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Euclidean Distance was preferred to others simply 
Dewause of its geometric, intuitive appeal. | 

On the other hand, a more elaborate rationale went 
imcvo the decision to standardize by range. First of all it 
was decided that some type of standardizing (scaling) would 
Besoppropriate. Most of the literature argues convincingly 
taeas scabing is inappropriate when the difference in scale 
between two variables may be intrinsic; but no such intrinsic 
Gifferences seemed likely in the five variables used here. 
Moreover, using unstandardized Euclidean Distance in this 
SeEvaulon Wemmascteaely result in the DC being driven by 
Variables 2 and 3 while Variables 1 and 4 would be eee : 
memored., sand sere is no\a priori reasoneto intentionally 
emphasize one variable over another in this application, 

Havimiguaecided to use some type of scaling, there 
were many types to choose from, namely eecmee bye Gandand 
deviation, scaling by range, and scaling by some other 
heterogeneity measure (see page 326 of Ref. 4 for a comparative 
discussion). Since the data distributions were mixed there 
Wee no compelling theoretical reason for choosing one scaling 
method over another. Eventually, scaling by range was 
selected for its Pa cle y, It would be interesting to see 
if scaling by standard deviation would significantly change 
the end result (dendrogram) from that obtained here; but this 
was not done. 

Using Euclidean Distance standardized by range, the 
425 data points in Table I were transformed into a matrix of 


3570 Dissimilarity Coefficients. 
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> eeChoace or” Algomithm 
There are several methods of proceeding from the matrix 
Of Dissimilarity Coefficients (DC) to a partition (or a dendro- 
oram Of partitions) of the countries. Choosing one was the 
_next decision point in this demonstration. All methods in 
general use fall into one of three categories: 


a. Agglomerative Algorithms ~ a series of successive 
fusions of the 85 countries into groups. 


b. Divisive Algorithms - a partitioning of the complete set 
of countries successively into finer partitions. 


Co” "Realveearvrve nl foritnhms -— successive reallocation of 
individual countries between the sets of some 
iINnitedeepartic ion. 
It was first decided that a reallocative algorithm would be 
Mmapproprraveswceause Tt Yequires an initYTal partition, “and 
there was no a priori evidence to suggest what that partition 
should be. Between the two remaining alternatives, theoretical 
considerations did not yield a preference: in nearly every 
case, agglomerative and divisive algorithms produce identical 
dendrograms. The agglomerative algorithm was selected because Oa 
its details have been more thoroughly documented in the 
literature. 
Within the family of agglomerative algorithms there 
are at least eight documented alternative "sorting strategies" 
or formulas for determining the DC between cluster (k) and 
cluster (ij), using the DC between cluster (k) and cluster (i) 
and the DC between cluster (k) and cluster (j). If the matrix 
of Dissimilarity Coefficients contains natural and compelling 


clusters, each having strong internal cohesion and strong 
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external isolation, then the choice of sorting strategy is 

mer a crit@ical one, buc mt natural and@compelling clusters 

ome WOU present, different sorting strategies can produce 
markedly different dendrograms. The eight common sorting 
strategies are explained in Chapter 3 of Ref. 4. Of those 
emehnt, the Complete Linkage-Furthest Neighbor sorting strategy 
ana the Single Linkage-Nearest Neighbor sorting strategy 
represent the extremes. The others may be thought of as 
compromises between these two. The Complete Linkage-Furthest © 


Neighbor sorting strategy can be expressed mathematically as 
Deas ) =mmax( DC(k,i), BCCkK,3) ) 


ine produces@eempact clusters having high internal cohesion; 
but it may sacrifice external isolation when natural and 

compelling clusters are not intrinsic in the data. At the 
other extreme, the Single Linkage-Nearest Neighbor sorting 


strategy can be expressed mathematically as 
Deane) —“emin( DCWe,1), DCCK,3) ) 


It tends to produce wee of Membr eers Sinwadd 1 F1onetoysor 
imebead of, compact clusters ,sespecialiy wren svat ural and 
conpe ineweivsterseare not intrinsic in the data. In some 
applications this tendency is desirable, 

For the demonstration at hand compact clusters were 
considered he snenecwdesirabie thanechains, and the Complete - 


Linkage-Furthest Neighbor sorting strategy was selected. It 


would be interesting to see if one of the compromise sorting 
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strategies, such as Group Average, would Significantly change 
the end result (dendrogram) from that obtained here; but this 
was not done. 

The dendrogram in Drawing 1 was obtained using the 
Complete Linkage-Furthest Neighbor sorting strategy in an 
agglomerative algorithm. The computer program is listed at 
moe=end of this thesis for information. It should be noted 
that the matrix of Dissimilarity eesenietenee were standardized 
to the (0.0, 100.0) interval, using scaling by range, before 
they were input to the clustering algorithm. Bue Suck 
standardizing was made for computational convenience only. 
Its single effect was a monotonic transformation of the 
numerical scale across the top of the dendrogram. The shape 
of the dendrogram was unaffected. 

6. From Dendrogram to Cluster Profiles 

Having obtained the dendrogram in Drawing 1, and 
recalling that the purpose here was to highlight the inter- 
actions among variables, it remained only to select a level 


of clustering, identify the partition of countries there, 


plot the distributions of variables within each cluster, and . 
compare these plots. But several iterations of this process 
were required before the interactions among variables began 
to appear. 

The first attempt was at level k = 3. Here the - 


United States appeared alone in one cluster, and the other 
two clusters contained 41 countries and 43 countries 


respectively. Apparently the United States was alone because 
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of its extreme values in variables 2 and 3. At this point 

the United States was evaluated as an outlier and was not + 
included in subsequent comparisons. Within each of the other 
two clusters, the mean and standard deviation of each of the 
five variables were computed. After a short perusal of these 
20 statistics it became apparent that no interactions among 
variables.were highlighted at this level. Within four of the 
five variables, the two means were displaced from each other 

by less than the sum of their standard deviations. 

Ponsune seeond iteration, level k = 7 was chosen: 
Again the pair-wise displacements between means were compared 
to standard deviations. The standard deviations were definitely 
smaller here than they had been at the k = 3 level: within 
cluster homogeneity had improved. Between cluster hetero- 
geneity had improved to a lesser extent: many of the means 
were well separated but several others were not. 

For the third iteration, level k = 8 was chosen. Here 
mere were three clusters comvtaining only one country each. 
All three were dismissed as outliers, leaving five clusters 
ior furthers, . Within each of these five clusters, the 
mean and standard deviation of each of the five variables 
were computed, using a straightforward computer program. 

These statistics are listed in Table II and displayed graphi- 
cally in Table III. After a relatively brief perusal of 
Table Tre several possible interactions among the variables 


came to mind. Then a quick double-check of Table II confirmed 
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that four of those possible interactions were probable 
interactions. These probable interactions are listed in 
Maple IV. 

None of these "probable interactions" was verified 
mathematically. The first one would have been relatively 
easy to check ouv, by computing 10 correlation coefficients. 
But the others would have required considerably more ingenuity. 
Since the purpose here was to demonstrate a new application of 
eluster analysis rather than to deduce substantive results, 
mathematical verification was considered beyond imac BaCcOpe ‘Of: 
this thesis. 

However, there was a further step, within the scope 
of this thesis, that might have been pursued but was not. It 
would have been logical to proceed next to another cluster 
level (perhaps k = 13) and again look for interactions. Such 
reiverations might well confirm or refine the interactions 


already deduced, and nighlight additional interactions as well. 


B. @M@COND APPLICATION: TO CLASSIFY COUNTRIES OBJECTIVELY 
1. Gene@al Description 

in thas application, the user presumes to understand 
the variables used, and the interactions among these variables, 
at least on a superficial level. The purpose here is not to 
researc menen cesaomes, but rather to objectively classify the 
countries. The previous application (to highlight interactions 
among varvanles) produced a dendrogram only as an intermediate 
step before producing "cluster profiles" as the final product. . 


But the current application seeks only the dendrogram itself, 
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TABLE II CLUSTER PROFILES OUTPUT FROM HIGHLIGHTING DEMONSTRATION 
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Variable 1. Concen. of Population in Cities 
min=0,0 05 e LO e15 0.21=max 


ees), 


went 5 


Variable 2. Radios per 1000 Population 
min=5.) 200 100 600 800 1000 1233.5=max 


Variable 3. Students in Higher Ed. per Mil. Pop. 
min=6.0 8000 16000 - 21000 28),00=max 
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Variable ). Ethno-Linguistic Fraction. 
min=0.0 2 ot BO 8 0.926=max 


Variable 5. Press Freedom Index 
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TABLE IV 


RESULTS OF HIGHLIGHTING DEMONSTRATION: 
PROBABLE INTERACTIONS AMONG VARIABLES 
DEDUCED AT CLUSTER LEVEL k = 8 


ima Paere appear £0 besno sisenificant woair-wise comme hataons 


(either positive or negative) among’the five variables. 


2, A high value in Variable 3 tends to be accompanied by a 


high value in Variable 5. But the inverse and converse 


are not true (i.e., a high value in Variable 5 does not 


jmply a hagn value in Variable 3, and™a low Value in 


Variable 3 does not imply a low value in Variable 5). 


3. <A very high value in Variable 4 tends to be accompanied 


Pease loweavaluemin Variables Jy cc,sand 3. 


4, The combination of high value in Variable 1 and a low 


value in Variable 4 tends to be accompanied by a high 


value in Variable 5. 


For ready reference, the variable names are: 


Variables l. 


Variable 


Variable 


Variable 


Variable 


Concentration of Population in Cities, 1965 
Radios per 1000 Population, 1965 


Students in Higher Education (Third Level) 
per One Million Population, 1965 


PeaniOom=binmeniastic KFractionali zation 


Press Freedom Index, 1965 
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COmerNecettvelyecO)t@em. or perhaps modity, the user's previous, 
Subjective, classifications. 
Co eed Owl or Denorstration 

People commonly think about the countries of the world 
as members of clusters. They use labels like "the Western 
World", "the Communist Bloc", "the Have's" and "the Have-not's" 
every day, and they frequently hear mane esoteric terms like 
"tri-polar world", "five-polar world" and "spheres of influence", 
all of which have classificatory overtones. 

No doubt such classifications are convenient and useful; 
but as they exist now, many are also subjective and confusing. 
When two speakers discuss the behavior of "the Communist Bloc" 
without first enumerating the members of that bloc, they may 
disagree violently until they discover that one of them includes 
Cuba and Chile in his definition but exludes Yugoslavia, while 
Pie me lie me come tne neverse. If our classifications of ee 
countries are useful but subjective, it would seem desirable 
to make them more objective. : 

Imagine wtm@at a political scientist in the State 
Department wished to inject some objectivity into the terms 
"Western World", "Communist Bloc", "Soviet Bloc", etc. His 
first step would probably be to identify the several theories 
(form of government, internal economic system, external 
political ties, external economic ties, etc.) that are commonly 
used to define the terms in question. Then he would probably 
select one of these theories for quantification and search out 
measurable factors (preferably statistics that had already 


been measured) with which to express it. 
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Hor cxample, imagine that he selectedethe theory that 
external economic ties are the prime mover in the concept of 
bloc membership. Then his search for measurable factors would 
certainly lead to statistics such as level of foreign aid 
received from every other country, value of imports received 
eon every Ovles COUMULry, ald wale “Or exports sent to every 
@vyner country, each of these statistics prorated against the 
host Country's GNP and/or population. 

The final step for our State Department researcher 
would be to combine these measurable factors mathematically 
so as to output a bloc membership label for each country of 
the world; that is, he would write a "factors-to-bloc trans- 
formation". If he were not acquainted with Cluster Analysis, 


Peony Welweury = LO Wrave a singile=wrunction of the form 


bloc membership = F(aiduas aiducops Aldiys oss 


: trade trade trade trade 


US? USSR? JAPAN? 


GNP, population, etc.) 


where "bloc membership" is a discrete variable which can take 
on three or perhaps five predetermined values. But such a 
function would probably be crippled by two weaknesses: exces- 
Sive complexity and theoretical inadequacy. The reader can 
certainly visualize how complicated such a function would have 
to be in order to have broad applicability. Moreover, no 

Mac ler homecomplex the function, it would necessarily Lenore 


an obvious fact about blocs of countries: two countries can 


eo 


COM. MKT. ? 





Be cleeely bound in a bloc not by economic dependency On eacn 
Genecreeuy DY Paelresimutvaneous economic dependency on an 
intermediate country. 

Cluster Analysis has far more potential as a "factors- 
to-bloc" transformation. It does not share the dual weaknesses 
Ge the functional transformation. First of all, the ability 
wer Sooo UWOmecolunurlcS  L@couner tThrousa an intermediary is 
intrinsic to every clustering algorithm (so long as the 
Complete Link-Furthest Neighbor sorting strategy is not used). 
Amd secondly, Cluster Analysis requires that the user define 
only a transformation from measurable factors to a pair-wise 
Pessimilarity Coefficient rather than a transformation from 
measurable factors to bloc membership. Surely the former 
pimeouliemoe lees comolex tham the latter. 

3. Chorsee of Bava 

TOM@emonstrate this application, the author again 
usurped the role of State Department political scientist and 
selected the following statistics as measurable factors with 


wich to objectively elagsity countries into blocs: 7 


Variable 1. Gross National Product per Capita, 1965 


Variable 2. Trade as percentage of Gross National Product, 
1965 


Variable 3. Soviet Aid’ per Capita, 1954 - 1965 


Variable 4. U.S. Economic Aid per Capita, 1958 - 1965 


Values of these variables for each of 85 countries are listed 


in Table V. 
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TABLE V - DATA INPUT TO OBJECTIVE CLASSIFICATION DEMONSTRATION 
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Here again the unavailability of data made the 
demonstration artificial. As was asserted in the preceding 
SeetehkO ms, ee of countries by external economic ties 
should depend primarily on pair-wise data. But this researcher 
did not have access to any standardized, comprehensive pair- 
wise data. Variables 3 and 4 above are pair-wise but not 
comprehensive. Foreign aid is provided in substantial amounts 
mec OUunuUrIeS Obheresthan the United States and the Soviet Union. 
BUC this mesearcher could not locate any but the most piece- 
meal data on other donors. Variable 2 above is not pair-wise 
at all. Pair-wise trade data is collected by the Interna- 
tional Monetary Fund, and their data is both standardized 
and reasonably comprehensive. But that data is not made 
availabie to the public in comprehensive form. Without 
pair-wise trade data it is virtually impossible to construct 
aelopical=tneors for blocking countries by external economic 
ties. Nevertheless, this demonstration was carried through 


to completion because its purpose is not to deduce substantive 


latter completion of the research described here, the 
author did obtain access to the IMF data and began a Cluster 
Analysis on it. But the results were not obtained in time 
MemincorpOieate them in this thesis. See Section II.B./7 for 
a description of the work in progress. The data was obtained 
Eomouch stiicmenter—-University Consortaumefor Political Research, 
On computer tape. The reason why the data is not generally 
available was obvious: its sheer magnitude. For purposes of 
Giva cOlmmecuuem, cher Imr defines cU7 countries, and 207 
COUMEL eS eGakenstwoeat atime produce 215321 trading combina- 
tions. The: complete data file contains almost 500,000 numbers. 
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Peswles but tO Gemonstrate a procedure. And that procedure 
can still be demonstrated using the foreign aid data (Variables 
3 and 4) which is pair-wise, although incomplete. 
4, Choice of Dissimilarity Coefficient 

whis section deseribes the process of converting the 
Seca in Gaole VY te ammatrrix of Dissimilarity Coefficients. 

When Cluster Analysis was used to highlight the inter- 
actions of variables (Section II.A.4 above), the choice of 
DC was motivated by a desire to have all five variables 
weighted equally, to prevent the user's preconceptions from 
eereculne@ une meisults. Precisely the opposite is true here. 
Here the author presumed that he already knew how the 
variables interact. He wanted to incorporate that knowledge 
into the DC. The DC was constructed using the following 
rationale: 

First of all, it was decided that the DC between a 
foreign aid donor and any other country should be inversely 
related to the level of that foreign aid. Thus, for a first 


cut, the formulas 


i 


1 + sovaiad 


i 


I ¥ usaid, and DGGS OV...) fis 


DeCUS i) ~ 
a 

were considered. Next it was observed that 27 of the 85 
countries received foreign aid from both the United States 
and the Soviet Union. To incorporate relative dependency 
into the formulas, it was decided to insert a ratio of aid 


Jeveis. fence the folhewime formulas were considered. 
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1 + sovaid, 1 + usaid, 


Bemus, ) and DO SOV) i _ a 
1 + sovaid 


l + usaid, “ 


These formulas seemed reasonable except that the same level 
enone ne) amiaomoaere mei mpactwon a rich country than it 
Gees On a POorweommtry. So it was decided to insert GNP, as 
a scaling factor wherever an aid term appeared in either 
formula. But this insertion tended to greatly reduce the 
size of the aid terms with respect to the "1" terms. There- 


fore the "1" terms were arbitrarily reduced to "0.1", 


producing 
Sovaid, | usaid, 
+ GNP, Onii+ GNP; 
peCUS,1) = usaid, and DOGSOV <1 > = — sovaid, 
0.1 +—_——— Ol Gal ae CNP 
GNP, au 


The fact that DC(US,i) and DC(SOV,i) are reciprocals and the 
fact that they are dimensionless had intuitive appeal. The 
only apparent Says arene tee were the two imposed by unavaila- 
bility OTetaea. ald from other countries is ignored, and 
pair-wise trade is ignored. Although a total trade figure 
was available, there seemed to be no logical way to substitute 
it for the missing pair-wise figure. 

At this point it was verified that the DC(US,i) 
formula would apply to every country dyad in which the United 
States is a member, except for the United States - Soviet Union 


Gece. woiebocl ye the DCCSOV 1) formulla applies to every 
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COUNT iIgw Gyac in whach the Sovier Union is a member, except 
for the United States - Soviet Union dyad. Thus it remained 
Tormcommeruct Pormulas for the United States - Soviet Union 
dyad and for all dyads in which neither the United States nor 
the Soviet Union is a member. Hopefully the same formula 
would apply to both. | 
But here the lack of pair-wise trade data was really 
crappling. The onlly patr=wiise economic ties of any signi- 
ficance involved pair-wise trade. The only logical formula 
necessarily involved the inverse of pair-wise trade. There 
seemed no natural way to use the available data on total 
rade. Fimally, in d@speration it@was’ ratiPonalized *thatra 
country whose foreign trade is large with respect to its GNP 
tends to have closer ties with another country in the same 
situation. It was decided that the nucleus of the formula 


should be 
DeGy) O7)| trade... — trade, | 


But because of the weak theory here as compared ta the 

rigorous formulas Nie BOGUS Sf) and BDECSOV /ieeert was decided 
to diminish the effect of trade difference when foreign aid 
recipients are involved. Hence it was decided to expand the 


formula to 











sovaid, tusaid, + sovaid, +usaid, 
|trade, = trade, | 4+ ______4 ____4 


DOC1L 53) = rT 
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Mie -aeervl of tneorevical 1oundation here is admitted. It 
casts suspicion on the values in the DC matrix and on the 
acmcrogw~am finally opvdined. “puc the reader is again reminded 
that the purpose here is to demonstrate the procedure, not to 
Cea@uce SubStanvuive results. 

Turning from the substance to the procedure, there is 
a= significant departure from normal Cluster Analysis procedure, 
taken above, that warrants explanation. Two completely differ- 
ele ve 1 Ormulas havewpecn developed, “one to be used when the 
United States or Soviet Union is a dyad member and another 
to be used the rest of the time. The mathematical significance 
ee unis dUaliuy 1s unav the DO rormulas, taken collectively, 


produce gross violations of the metric inequality, which is 
DG + DC(b.c) > DECaxc) ieou@ueren! Lely ey ae 


Generally, it is desirable although not essential that a matrix 
or DC'S Sa@umocry the metric inequality. When they do not, the 
clustering algorithm can be expected to produce high 
"distortion" between the matrix of DC's and the dendrogram. 
(Loosely defined, WH stortion* is the difference between 

woe. }) and the level at which COUNtryY meanaecountry j cluster 
together in an agglomerative algorithm.) But of what signi- 
fLeance toe mnen arstvorvron?’ The word carries derogatory 
COMNOvaerrems, DUL 15 Gilsvortion really undesirable in Clwster 
Paolivsrs@. wars author maintarns that it depends on the purpose 
of the clysvering. In the “highlighting of variables™ appli- 


cation, distortion was not desirable: figuratively speaking, 


ane 





each country had been plotted in five-dimensional space and 

the clustering algorithm was searching for natural clusters, 

as plotted. But in this “objective classifying of countries" 

Soest iOn, Alswortton 1S Natural: the original pair-wise 

Similarities specified in the DC matrix cannot be expected 

to be representable in Euclidean space, and during the 

clustering it is desired that these original similarities be 

affected by intermediate countries. With this reasoning, it 

is asserted that violation of the metric inequality is neces- 

sary snd Sues the use of two or more DC formulas is acceptable. 
Using the formulas developed above, the 340 data points 

in Table V were transformed to a matrix of Dissimilarity 

Coefficients. ‘ 


aa Moy 
5.  Chetcesor Milporitenm Va 


\ C7 /71C =m 


Here, as in the highlighting demonstration, the first 
decision point was to choose among the agglomerative, divisive 
and reallocative algorithms. Again the divisive algorithms 
were discarded pecavse they are not as well documented as 
their agglomerative counterparts. The reallocative algorithms 
dag mot apply because they require that the DC be a metric. 
Hence the agglomerative algorithm was selected. 

The final decision was to select a sorting strategy. 
The Complete Linkage-Furthest Neighbor strategy was eliminated 
from consideration here; it does not permit any eha ian, 
which is desirable in this application. On the other end of 
the spectrum, the Single Linkage-Nearest Neighbor sorting 


strategy maximizes chaining, often to the extent that natural 
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clusters are obscured. For this application it was deter- 
mined to use one of the compromise sorting strategies. Among 
these, Group Average sorting seemed to correspond with the 
eoncepts of bloc membership, ratios of foreign aid, etc. It 
is expressed mathematically as 


1G Nn. 
ct) ei) nee, )) 


emmy Ses 

The dendrogram in Drawing 2 was obtained using the 
Group Average sorting strategy in an agglomerative algorithm. 
But once again the reader is cautioned that the results are 
Suropec v . 

6. Explanation of Dendrogram 

Despite the admitted artificiality of the results 
obtained here, the layman might appreciate an explanation of 
the information available in any dendrogram produced by 
Cluster Analysis. ‘ 

The key to reading a dendrogram is the concept of 
"cluster level." OP ee specifying a cluster level, the 
following information can be read from the dendrogram: the 
number of clusters and the countries contained in each cluster. 
That is, there is a correspondence from cluster level to a 
ParvulvelOnewer = tne COUNTries. 

The scale at the bottom of Drawing 2 is a cluster 
tewelwscale. Note that the minimum value of cluster level is 


0.0 at the far left and the maximum value is 20.0 at the far 
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right. A low cluster level specifies a partition having many 
small elusters, while a high cluster level specifies a parti- 
tion having a few large clusters. Thus cluster level can be 
thought of as a measure of the largest dissimilarity (or, 
equivalently, the weakest bond) present within any cluster in 
Chem partationms 

For caaample, Consa.cger clUusver Level 0.0, the minimum 
observed cluster level in Drawing 2. At cluster level 0.0, 
the 85 countries are partitioned into 77 clusters. Seventy- 
mae of these 77 contain only a single country. Four of the 77 
G@@nvain exactly two countries. And one cluster contains 5 
countries: Canada, Ireland, Switzerland, Sweden and Denmark. 

Y eutrice 0.0 is the minimum observed cluster level, we may conclude 
that the strongest possible bonds exist within every cluster. 
specifically, we may conclude that Canada, Ireland, Switzerland, 
Sweden and Denmark are bound together by the tightest possible 
economic ties. Our mathematical model will not separate them 
even at the lowest cluster level. 

Consider next a slightly higher cluster level, say 1.3. 
Here we are SeEMiit tine slightly weaker bonds to be present 
within clusters. We find that the 85 countries are here 
parvrrirenee anton >> clusters ” Thirty-three of those* clusters 
contain a single country, twelve contain exactly two countries, 
five contain exactly three countries, one contains four 
conmmilliee and one contains nine countries. In the nine-country 
cluster, Canada, Ireland, Switzerland, Sweden and Denmark have 


been joined by New Zealand, South Africa, France and Australia. 


Wy 





We may conclude that slightly weaker economic ties bind the 
four new countries to the origamal five. 
Similar inferences can be drawn from any dendrogram 


preduced by Cluster Analysis. 
wes 


a 


7. Work in Progress 5 wee 4 


[hrovusgheubegars seconde demonsimeation,.of Cluster 
Analysis it has been emphasized that the unavailability of 
pair-wise trade data made the demonstration artificial. But 
this artificiality can soon be removed. Pair-wise trade data “Fa 


OC4AL, 


nr7 


was recently provided to this author through the Inter- he 
University Consortium for Political Research [Ref. 9]. Time . 
will not pérmit this author to complete a Cluster Analysis | 
ome one data g@ipul af another resecareherm chooses to undertake / 
it, the following plan of attack is suggested. / 
a. Step 1 -— Reduce data file to manageable size a 

ive P@rR data filepcontaims seapproxiamately 333,720 
pairwise trade data: annual trade values, in millions of 
U.S. dollars, for the years 1958 through 1968, among 207 differ- 
ent "countries." Many of these "countries" are actually 
colonies and many Suen have negligible foreign trade except 
with a single "sponsor country." The logical first step is 
to selectively reduce the size of the data file by eliminating 
the insignificant "countries", and by selecting a single year 
and eliminating the other nine. It is recommended that all 
"countries" be eliminated except the 136 nations having a - 


population of one million or more and those smaller nations 


having membership in the United Nations as of 1968. These 
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136 nations are listed on pages 1 through 4 of Ref. 8. It is 
further recommended that the year 1967 be used and the rest 
be eliminated temperarily. The author has determined that, 
through the first one-sixth of the file, 1967 has fewer zero 
entries than any other year (a zero entry signifies either 
trade less than 100,000 dollars or missing data). This 
selective reduction of the data should reduce the file length 
Be about one eighth its original length. 
b. Step 2 - Sort and combine data 

Preparatory to sorting the data, the reduced data 
file should be stored on either a disk or a data cell rather 
than magnetic tape. The ICPR normally provides the data on 
bape, and tape 1Sea satisfactory input to the data reduction 
process in step 1 because that process can be sequential, 
reading the fale once from beginning to end. However, the 
sorting proc@ss about to be déscribed cannot réad the file 
sequentially, and magnetic tape is a very inefficient THpuUG 
TOeproceSegeechavemust search the data. 

ahe ICPR data file does not list one trade figure 
Per cCOUNnCryeemecdeper year. it lists up to four figures, 
namely, | 


1. Value of exports from CO Miomeemceperved by i 


al 
2 Value of exports from i to j, as reported by j 
3. Valweeo. exports from jJ to i, as reported by j 
4 j 


Value of exports from to i, as reported by i 


Hopefully numbers 1 and 2 are approximately equal and numbers 


3 and 4 are approximately equal. If so, then total trade 


13 





between i and j is the sum of 1 and 3. It is recommended 
that this approximate equality be assumed for the initial run 
of this "sort and combine" process. Then the process is 
Simple: search the file for the first record involving the 
Paved, Ldenbify it With respect tO direction of trade, 
regardiess of reporting country; continue searching for the 
second record involving the i-j dyad; identify it with respect 
to direction; if the directions are opposite then sum the two 
values and store them; if the directions are the same then 
ignore the second value and continue searching for the third 
record; and so on. The reason why shortcuts are in order for 
the initial run is that this "sort and combine" process will 
have to be performed 9180 times (136 countries, taken two at 
a time, yields 9180 different combinations). 
ec. Step 3 - Choose a Dissimilarity Coefficient 
The following formula is recommended as a DC, at 


least initially: 


i 


Vet ieade... 
1J 


DC(i,j) = 


More elaborate formulas can be developed later by incorporating 
the rationale in Section II.B.4 of this thesis. 
d. Step 4 - Choose a Clustering Algorithm 
It is recommended that an agglomerative algorithm 
yen Croan Pyerege sorting strategy be used, for the same 
reasons that it was selected in Section II.B.5 above. This 


involves making the following additions and substitutions in 
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Gime cOomMpuLer propram Tasted at the end of this thesis: 


immediately Defore = DOV/Geb-=i5N insert the following two 
statements: 


Ramee = Sere 7 GSA, A) +S(3B.B) ) 

RATB = S(B,B)/(S(A,A)+S(B,B)) 
In place of 

DS(E) = 


SAC ICSGE 2) , SCE ,'B) ) 
substitute 

Poe = ReA*S(h.A) + RATB*S(E,B) 
peariy, ine place of 

TORS CE) 


AMAX1(S(A,E),S(B,E)) 
Substitute 

70 DS(E) = RATA®S(A,E) + RATB*S(B,E) 
And finally, in place of 

7a) DSIGE) = AMAX1(9(E,*) .S(BSB) 


substitute 
fA Sep) 


RATA*S(E,A) + RATB*S(B,E) 
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LIT. CONCLUSION 


This thesis has demonstrated two metentsaal uses of Cluster 
Analysis in which the nations of the world are treated as 
measurable objects. The substantive results obtained in each 
demonstration are not presented as percaetions: they were 
derived incidentally while demonstrating methods. It is 
asserted that the two uses illustrated here, markedly differ- 
ent in several respects, are representative of a wide range of 
applications for Cluster Analysis in the fields of political 
science and international relations. Although Cluster Analysis 
was developed for the physical sciences and has so far received 
scant attention outside that context, it is readily adaptable 
wo the social sciences. in particular, it is extremely well 
Smiucted to model building and statistical analysis involving 
the nations of the world. As such, it warrants the attention 


of the U.S. State Department. 
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APPENDIX A 


DATA 


Except for three data points, all data used in this thesis 
were made available by the Inter-University COnsOorvium rer 
Political Research. The data were originally collected by 
Charles Lewis Taylor and Michael C. Hudson. Neither the 
original collectors of the data nor the consortium bear any 
responsibility for the analysis or interpretations presented 
here. 

Following are the precise definitions of the nine vari- 
ables used in this thesis. All definitions are extracted 


merbetimefrom Ret. &. 


Variable name: Concentration of Population in Cities, 1965 
Definition: Concentration is defined as: the sum over all 
cities of the squares of the Bec peiior of the total popula- 
von resiGiweeeiieceacn City. Concentration is higher the rewer 
cities and the greater the size of the largest city relative 


to the vovelmpepulation. [Ref. 8, p. 16] 


Variable name: Radios per 1000 Population, 1965 

Definition: Figures relate to all types of receivers including 
those connected to a re-distribution system. They relate 
either to the number of licenses issued or sets declared or 

to the estimated number of receivers in use. In many countries 
a license may cover more than one receiver in the same house- 


hole. “Saremexclude television sets. [Ref. 8, p. 32] 
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Variable name: Students in Higher Education (Third Level) 

per One Million Population, 1965. 

VeeretOn:; Datarreier FO thesenkoliment_in all institutions 
Se soducation aueehe thimd level, &.ec., degree granting and 
non-degree granting institutions of both private and public 
henner ,educationmol a1 types. These include universities, 
baeeher Lechnical~schools, teacher trainang schools, theological 
schools, etc. As far as possible part time students are 
included in the figures but correspondence courses and auditors 


are generally excluded. [Ref. 8, p. 41} | 


Variable name: Ethno-Linguistic Fractionalization 

Definition: The main source for this variable (Atlas Narodov 
Mira) makes little distinction between ethnic and linguistic 
differences in its definition and collection of data. Groups 
are determine@ meu by their physical characteristics but by 
Bier srolesseGhesmadescentsyand their relationships toe others. 
An index of fractionalization calculated upon data from Atlas 
does correlate highly with a similar index calculated upon 
linguistic data from other sources, but not quite highly 
enough to be considered the same indicator. Other sources 
used here report only linguistic data. Index of fractionaliza- 


tion was calculated by the following formula: 


F 1 (N subi / N) (N subi — 1/N-1) 


where N subi = number of people in the ith group 


and N = total population [Ref. 8, p. 46] 


48 





Variable name: Press Freedom Index, 1965 

Por rTOn. fils andem,ecreavead by the School of Journalisn, 
University of Missouri, is "designed to measure the indepen- 
dence of a nation's broadcasting and press system and its 
ability to criticize its own local and national governments." 
The index is comprised of the judgements of panels of native 
and foreign newsmen on 23 aspects of the press (e.g., extent 
of legal controls, licensing, government ownership, criticism 
and censorship). For a fuller description, see Ralph L. 
Lowenstein, "PICA (Press Independence and Critical Ability) 
Index: Measuring World Press Freedom," University of Missouri, 
school of Journalism Freedom of Information Center Publication 
#166 (August, 1966). The index, which consists of averages 

of the judges' scores, has a range from -4.00 for less freedom 


@o +4.00 for more. [Ref. 8, p. 116] . 


Variable name: Gross National Product per Capita, 1965 
Definition: This variable was derived by dividing Gross 
Pational Product i millions of U.S. dollars ™by total popula- 
tion in thousands. Gross National Product is reported in 
COnsvanu Uso mcel Mars and refers to gross national product 
even for countries which normally report their national 
acCOUnTSeimegerms Of net Material product or other concepts. 


[Rer. & ,2pene> | 


Variable name: Trade as percentage of Gross National Product, 
1965. 


Definition: This variable was derived by dividing total trade 
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(imports plus exports, merchandise only) by Gross National 


Preauct. Ref. ).p."69% 


Variable name: Soviet Aid per Capita, 1954 - 1965 
Definition: This variable was derived by dividing total 
Soviet aid by total population. Total Soviet aid data refer 
to Soviet economic prackiee and grants to countries in terms 
of thousand U.S. dollars for the period 1954/5 - 1965. 


imer. 85 p. Ov) 


Weriable name: U.S. Economie Aid per Capita, 1958 - 1965 
Definition: This variable was derived by dividing total 
tee CCONOMMCmara, DY vOlal woouullation. Total U.S. economic 
aid data refer to grants and loans and are given in millions 
Of U.S. dollars for the period July 1, 1958 through June 30, 


ees. PRef. (6p 107) 


The three data points not provided by the ICPR are listed 
below. The ICPR data file listed all three as missing data. 
But in each case this author preferred to introduce an approxi- 
mate (or even erroneous) value rather than eliminate the 
particular country from the Cluster Analysis. Hence the three 
values were estimated in the manner specified. Note that no 
two estimations involved the same country. All countries 
missing two or more data (among the nine variables used) in 
the ICPR data file were omitted from the Cluster Analysis at 


the outset. 
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Counery: Chile 

Variable name: Radios per 1000 Population, 1965 
Estimated value: 240.0 

Method of estimation: Average of values for Peru and 


mecenctina. 


wountcry: Chad 

Variable: Students in Higher Education (Third Level) per 

One Million Population, 1965 

Estimated value: 230.0 

Method of estimation: Average of values for Mali, Upper Volta, 


Sudan and Cameroon. 


Country: dzambia 

Variable: Students in Higher Education (Third Level) per 
One Million Population, 1965 

Mstimated weilue: i1/0.0 

Method of estimation: Average of seventeen neighboring 


countries. 
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COMPUTER PROGRAM 


Clustering by Agglomerative Algorithm with Complete Link Sorting Strategy 
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